44 research outputs found

    HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces

    Full text link
    Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases. HD-Index consists of a set of novel hierarchical structures called RDB-trees built on Hilbert keys of database objects. The leaves of the RDB-trees store distances of database objects to reference objects, thereby allowing efficient pruning using distance filters. In addition to triangular inequality, we also use Ptolemaic inequality to produce better lower bounds. Experiments on massive (up to billion scale) high-dimensional (up to 1000+) datasets show that HD-Index is effective, efficient, and scalable.Comment: PVLDB 11(8):906-919, 201

    Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

    Full text link
    Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.Comment: 8 page

    Etiologies and Predictors of 30-Day Readmission in Heart Failure: An Updated Analysis

    Get PDF
    BACKGROUND AND OBJECTIVES: Readmissions in heart failure (HF), historically reported as 20%, contribute to significant patient morbidity and high financial cost to the healthcare system. The changing population landscape and risk factor dynamics mandate periodic epidemiologic reassessment of HF readmissions. METHODS: National Readmission Database (NRD, 2019) was used to identify HF-related hospitalizations and evaluated for demographic, admission characteristics, and comorbidity differences between patients readmitted vs. those not readmitted at 30-days. Causes of readmission and predictors of all-cause, HF-specific, and non-HF-related readmissions were analyzed. RESULTS: Of 48,971 HF patients, the readmitted cohort was younger (mean 67.4 vs. 68.9 years, p≤0.001), had higher proportion of males (56.3% vs. 53.7%), lowest income quartiles (33.3% vs. 28.9%), Charlson comorbidity index (CCI) ≥3 (61.7% vs. 52.8%), resource utilization including large bed-size hospitalizations, Medicaid enrollees, mean length of stay (6.2 vs. 5.4 days), and disposition to other facilities (23.9% vs. 20%) than non-readmitted. Readmission (30-day) rate was 21.2% (10,370) with cardiovascular causes in 50.3% (HF being the most common: 39%), and non-cardiac in 49.7%. Independent predictors for readmission were male sex, lower socioeconomic status, nonelective admissions, atrial fibrillation, chronic obstructive pulmonary disease, chronic kidney disease, anemia, and CCI ≥3. HF-specific readmissions were significantly associated with prior coronary artery disease and Medicaid enrollment. CONCLUSIONS: Our analysis revealed cardiac and noncardiac causes of readmission were equally common for 30-day readmissions in HF patients with HF itself being the most common etiology highlighting the importance of addressing the comorbidities, both cardiac and non-cardiac, to mitigate the risk of readmission

    Factors Associated with Revision Surgery after Internal Fixation of Hip Fractures

    Get PDF
    Background: Femoral neck fractures are associated with high rates of revision surgery after management with internal fixation. Using data from the Fixation using Alternative Implants for the Treatment of Hip fractures (FAITH) trial evaluating methods of internal fixation in patients with femoral neck fractures, we investigated associations between baseline and surgical factors and the need for revision surgery to promote healing, relieve pain, treat infection or improve function over 24 months postsurgery. Additionally, we investigated factors associated with (1) hardware removal and (2) implant exchange from cancellous screws (CS) or sliding hip screw (SHS) to total hip arthroplasty, hemiarthroplasty, or another internal fixation device. Methods: We identified 15 potential factors a priori that may be associated with revision surgery, 7 with hardware removal, and 14 with implant exchange. We used multivariable Cox proportional hazards analyses in our investigation. Results: Factors associated with increased risk of revision surgery included: female sex, [hazard ratio (HR) 1.79, 95% confidence interval (CI) 1.25-2.50; P = 0.001], higher body mass index (fo

    Process Design, Optimization and Material Screening Methods for Small-Scale Chemical Manufacturing with Application to Unconventional Natural Gas

    No full text
    To meet the increasing global needs for energy, chemicals and commodity products, there is a substantial push for utilizing unconventional feedstocks such as stranded natural gas, shale gas, biogas and landfill gas. However, unconventional feedstocks pose significant challenges for centralized processing due to variabilities in scale and availability. In addition, the geographical sparsity, low feedstock quality and time-varying supply of unconventional natural gas feedstocks render existing chemical facilities inefficient for their utilization. As a result, it is challenging for conventional stick-built plants to keep up with evolving product demands and feedstock availability. An alternative is to develop small-scale, modular and intensified processes which are better suited for handling challenges associated with unconventional feedstocks and can better accommodate dynamic market conditions, process variabilities and geographical sparsity. However, the capital intensity (i.e., cost per unit production) of small-scale plants is much higher compared to their large-scale and centralized counterparts. In this thesis, to counter the diseconomies of scaling, computational frameworks and methodologies are proposed for cost-effective development of small-scale technologies. The proposed methodologies are based on principles rooted in multi-scale process development, dynamic process intensification and equipment standardization where small-scale, modular and intensified equipment modules with optimal materials are designed and operated for distributed chemical manufacturing. The utility of the developed computational frameworks is demonstrated through several midstream and downstream case studies prevalent in unconventional natural gas supply chains

    GRADES-NDA 2019: Joint International Workshop on Graph Data Management Experiences & Systems and Network Data Analytics

    No full text
    GRADES-NDA 2019 is the second joint meeting of the GRADES and NDA workshops, which were each independently organized at previous SIGMOD-PODS meetings, GRADES since 2013 and NDA since 2016. The focus of GRADES-NDA is the application areas, usage scenarios, and open challenges in managing large-scale graph-shaped data. To summarize, GRADES-NDA aims to present technical contributions inside graph, RDF, and other data management systems on massive graphs
    corecore